Introduction

In practice, it is easier to construct hashing algorithms which operate on relatively small, fixed input lengths, whilst still keeping the output length even smaller ( is still less than ). But hash functions are usually used on much larger inputs - for example, creating checksums for integrity verification of files. The Merkle-Damgård transform allows us to turn such a hash function, which operates on small fixed input lengths, into a hash function which operates on inputs of arbitrary lengths.

The Merkle-Damgård Construction

In particular, given a compression function which works with inputs of a "small", fixed input length and has outputs with length , the Merkle-Damgård transform allows us to use to a construct a hash function which takes messages of arbitrary length, denoted by , and produces digests of the same output length as .

The construction is similar to a block cipher in the sense that the message is chopped up into blocks. In contrast to block ciphers, however, this is done rather differently. Each block has length (since each block will be input into ), but it is not comprised entirely of message bits. Instead, each block contains message bits and the other bits () represent the so-called chaining variable for the current block.

This means that the message needs to be chopped up into message fragments , all of length . If the message length is not a multiple of , then the message is padded by appending a 1 to it and then appending 0s until the message length is short of a multiple of the fragment length by exactly the number of bits needed to encode the message length . The total length of the padding (including the 1, the 0s and the encoding of the message length) is denoted by .

When the message length is a multiple of the fragment length , padding still needs to be added. In a particular, an additional padding block is appended to the message, following the exact same procedure as before. The padding block begins with a and is followed by s - the last bits of the padding block again encode the message length .

The number of bits reserved for encoding the message length $l_{mb}$ is fixed for a given Merkle-Damgård construction. Usually $l_e$ is 64 bits, resulting in a maximum message length of $2^{64} - 1$, which is quite a reasonable limit.

After padding, the actual hash algorithm begins by appending an initialisation vector (IV) of length to the first message fragment . The IV is always a constant which is pre-defined in the specification of the Merkle-Damgård construction.

The SHA256 hash function uses the following 256-bit IV (the value is in hex):

$$IV := \texttt{0x6A09E667BB67AE853C6EF372A54FF53A510E527F9B05688C1F83D9AB5BE0CD19}$$

This initialisation vector serves as the initial chaining variable . The concatenation of the first message block with the IV is passed to the compression function , whose output becomes the next chaining variable. In general, the -th iteration takes the -th message block and appends to it the chaining variable . The chaining variable for the current stage is simply the output of from the previous iteration. The final output, i.e. the hash generated by the Merkle-Damgård function , is the final chaining variable.

Security of Merkle-Damgård Constructions

The reason why the Merkle-Damgård transform is used ubiquitously is the fact that it preserves collision resistance.

If the compression function $H'$ is collision resistant, then so is the Merkle-Damgård function $H$.

Suppose, towards contradiction that there is an efficient collision finder $\mathcal{A}$ which can find a collision in $H$ with non-negligible probability. Let $x$ and $x'$ be two inputs of lengths $L$ and $L'$, respectively, such that $H(x) = H(x')$. Let $x_1, x_2, ..., x_q$ be the $q$ blocks which $x$ is divided into, and, let $x_1', x_2', ..., x_{q'}'$ be the $q'$ blocks which $x'$ is divided into. Similarly, let $t_1, t_2, ..., t_q, t_{q+1}$ and $t_1', t_2', ..., t_{q'}', t_{q'+1}'$ be the chaining variables used at each iteration of the hashing of $x$ and $x'$, respectively (remember that the chaining variables $t_{q+1}$ and $t_{q'+1}'$ are also the output of $H$).

**Case 1:** If the two inputs have different lengths, i.e. $L \ne L'$, then the hash $H(x)$ is $t_{q+1} \coloneqq H'(x_q||t_{q})$ and the hash $H(x')$ is $t_{q'+1}' \coloneqq H'(x_q'||t_{q'}')$. However, $H(x) = H(x')$ means that $H'(x_q||t_{q}) = H'(x_q'||t_{q'}')$ which is a contradiction because $L\ne L'$ and so $x_q \ne x_{q'}'$ (remember that the length is appended to the message when padding) - we have found two different inputs which cause a collision in the collision resistant $H'$.

**Case 2:** If the two inputs have the same length, i.e. $L = L'$, then they are also divided into the same number of blocks $q = q'$. Let $I_i \coloneqq x_i||t_i$ denote the $i$-th input passed to $H'$ when computing $H(x)$, and let $I_i' \coloneqq x_i'||t_i'$ denote the $i$-th input passed to $H'$ when computing $H(x')$. Additionally, we will denote the output of $H(x)$ as $I_{q+1} \coloneqq t_{q+1}$, and we will denote the output of $H(x')$ as $I_{q'+1}' \coloneqq t_{q'+1}'$.

Now, $H(x) = H(x')$ and so $I_{q+1} = t_{q+1} = I_{q'+1}' = t_{q'+1}'$. This can only happen if $I_q = I_{q'}'$ or if $(I_q, I_{q'}')$ is a collision pair for $H'$ and the same logic propagates backwards - in general, $H'(I_i) = H'(I_i')$ can be true only if $I_i = I_i'$ or if $(I_i, I_{i'}')$ is a collision pair for $H'$. The inputs $x, x'$ are a collision pair for $H$ which means that $x \ne x'$ and so there *must* be some index $j$ for which $x_j \ne x_j'$ which means for sure that $I_j \ne I_j'$ and so $(I_j, I_j')$ turn out to be a collision in $H'$, which is a contradiction.